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Abstract 



In this paper we introduce a statistical inference framework for estimating the contagion source 
from a partially observed contagion spreading process on an arbitrary network structure. The 
framework is based on a maximum likelihood estimation of a partial epidemic realization and 
involves large scale simulation of contagion spreading processes from the set of potential source 
locations. We present a number of different likelihood estimators that are used to determine 
the conditional probabilities associated to observing partial epidemic realization with particu- 
lar source location candidates. This statistical inference framework is also applicable for ar- 
bitrary compartment contagion spreading processes on networks. We compare estimation ac- 

QQ . curacy of these approaches in a number of computational experiments performed with the SIR 

(susceptible-infected-recovered), SI (susceptible-infected) and ISS (ignorant-spreading-stifler) 

^D . contagion spreading models on synthetic and real- world complex networks. 

^' 

f^ , The structure of vast majority of biological networks (biochemical, ecological), technolog- 

Cn ' ical networks (internet, transportation, power grids), social networks and information networks 

(citation, WWW) can be represented by complex networks 1II6II . IlZI], |l3|]. Epidemic or contagion 
processes are amongst the most prevalent type of dynamic processes of interest characteristic for 
these real-life complex networks and they include disease epidemics, computer virus spreading. 



information and rumor propagation 11231 . Different mathematical frameworks have been used to 



j^ , study epidemic spreading. We can divide them into two major categories based upon assumptions 

they make: the homogeneous mixing framework and the heterogeneous mixing framework. The 
homogeneous mixing framework assumes that all individuals in a population have an equal prob- 
ability of contact. This is a traditional mathematical framework 111211 . llioll in which differential 
equations are used to model epidemic dynamics. The heterogeneous mixing framework assumes 
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that properties of contact interactions among individuals are defined via some underlying net- 
work structure. The small world network property [24J and the scale-free network property |J2|] 
iSfl have a great impact on the outcome of an epidemic spreading. We can further divide the het- 
erogeneous mixing framework by other assumptions: the bond percolation, the mean-field and 
the particle network frameworks. The bond percolation approach applies the percolation theory 



to describe epidemic processes on networks Ill4ll . Ollll . The mean- field approach assumes that all 



nodes having the same degree with respect to an epidemic process are statistically equivalent f4J], 
01711 . The particle network approach assumes that spreading process is characterized by particles 
which diffuse along edges on a transportation network and each node contains some non-negative 
integer number of particles (reaction-diffusion processes). 

The main question we address in this work is: Is it possible to detect location of the ini- 
tial source from partial information on the contagion spread over a network structure ? This 
research question is useful for many realistic scenarios in which we observe epidemic spread at 
certain temporal moment and would like to infer the source location (patient-zero). Our statisti- 
cal inference framework is applicable for arbitrary compartment contagion spreading model on 
arbitrary network structure. We have based our main case study on the SIR (susceptible-infected- 
recovered) model but we have demonstrated the applicability of inference framework on other 
contagion processes like the SI (susceptible-infected) and the ISS (ignorant-spreading-stifler) 
model. The SIR model |12|] is an adequate model for many contagious processes like disease 
modelling, virus propagation 1I20I1 or rumour propagation [15.1 . We base our inference study on 
rather general assumptions which can be relaxed: (i) that observed partial epidemic realization is 
defined by complete knowledge of infected and recovered nodes (ii) that probabilities for infec- 
tion and recovery of the underlying epidemic process are known in advance, as is the time from 
the start of the epidemic. We empirically demonstrate inference performance of the framework 
on different types of networks and for different contagion properties. We also investigate the 
impact on the performance of the framework in case when the assumptions are relaxed i.e. not 
complete knowledge on network status and when contagion parameters and time are uncertain. 
Finally, we demonstrate generality of the approach through solving source detection problem for 
different compartment models (SIR, SI and ISS). 

Recently, the problem of estimating the initial source has gained a lot of attention due to its 
importance and practical aspects. Under different assumption on network structures or spreading 
process different source estimators have been developed ll2lll .ll6l. l25l . l 181 . 1 131 . However, we 
have made a significant contribution in problems of source detection for more general spreading 
processes on arbitrary network structures. In this work we cast this problem into a statistical 
inference framework based on the maximum likelihood estimation of the source of observed 
epidemic realization. This inference framework relies on a large scale simulation of contagion 
spreading processes from the set of potential source locations and subgraph similarity measures. 

In section [U we describe the SIR compartment model. Section |2] we describe our statistical 
inference framework and define different maximum likelihood estimators and subgraph similarity 
measures used to infer conditional probability of epidemic realizations from particular source 
locations. In section [3] we describe experiments that demonstrate network, contagion dynamics 
and noise effects and section |4]explains the related work. 



Notation 



Notation 


Description 


G 


is a network with a set of nodes V and a set of edges E 





general variable which identifies source nodes 





specific value for variable, example: = 0, the source node is the node / 


P 


probability of infection in one discrete time step 


q 


probability of recovery in one discrete time step 


n 


number of simulations for a specific SIR process 


T 


temporal threshold (random variable or constant) 


R 


epidemic random vector R = {R{\), R(2), .., R(k)) 


R(i) 


Bernoulli indicator random variable for node / 


r 


epidemic realization, example n = (1,0, 0, 1, ..., 1) 


K 


observed epidemic realization 


r(i) 


i-th component of the realization vector r, example r = (1 , 0, 1 , 1), f(2) = 


Re 


random vector for realizations from node 6 


Rgj 


i-th sample realization vector from random vector Rg 


S 


set of potential sources 


vinJi) 


similarity measure between two realizations r i , r2 


f(r*,R0) 


random variable which measures the similarity between 
realization r, and realizations from random vector Rg 


i/'®(mi,m2) 


a bitwise XNOR function 


ifrs^(mi,m2) 


a bitwise OR function 


4i^(mi,m2) 


a bitwise AND function 


^An,n) 


the similarity calculated with XNOR(fi, rj) function 


Vj{r\,r2) 


the similarity calculated with Jaccard{f[, rj) function 


6{x) 


the Dirac delta function 



1. SIR compartment model 



We define the contact-network as an undirected and non-weighted graph GiV, E) (V-set of 
nodes or vertices, fi-set of links). A link (m, v) exists only if two nodes u and v are in contact 
during the epidemic time. We also assume that the contact-network during the epidemic process 
is a static one. To simulate epidemic propagation through a contact-network, we use the standard 
stochastic SIR model. In this model each node at some time can be in one of the following states: 
susceptible (S), infected (I) and recovered (R). The spreading process is simulated using discrete 
time step model. 

The SIR epidemic process is a stochastic process, which is simulated with n mutually inde- 
pendent simulation steps on the contact network G. At the beginning of each epidemic simulation 
all nodes from graph G are in the susceptible state except set of nodes which are initially infected. 
We assume in our treatment that epidemic parameters p and q are predefined, constant and known 
beforehand. The epidemic parameter p is the probability that an infected node u infects an adja- 
cent susceptible node v in one discrete time step. The epidemic parameter q is the probability that 
an infected node recovers in one discrete time step. Set of initially infected nodes is denoted with 
the letter 0. At the end of one full epidemic simulation, all nodes can be in one of two following 



states: susceptible or recovered. In our treatment however, we will limit epidemic spreading to a 
predefined number of discrete time steps, which basically means that we will deal with partially 
realized epidemic spreads and that this number of steps is also known parameter in the inference 
procedure for the source location estimation. 

2. Statistical inference on epidemic propagation realizations 

In this section we formulate the problem of the source localization in the network and develop 
related statistical inference framework. 

Epidemic source location problem 

Let us define the random vector R = {R(l),R(2), ...,R{N)), that indicates which nodes got 
infected prior up to some predefined temporal threshold T (random variable or constant). The 
random variable R(i) is a Bernoulli random variable, which assigns the value 1 if node / got in- 
fected before time T from the start of the epidemic process and the value otherwise. 

Let us assume that we have observed one spatio-temporal epidemic propagation realization 
r of SIR process defined by (p, q) and T, and we want to infer which nodes from the set S are 
the most likely source of realization r for the SIR process (77,(7) and r. S = {0i,02, ■■■,^a') is the 
finite set of possible source nodes that is defined by observed infected or infected and recovered 
set of nodes prior to moment T in the network. 

In order to find a node or a small subset of infected nodes that have highest likelihood for 
being the source of the epidemic spread, we pose the following maximum likelihood problem. 

= argmaxP(©|^= r), 

where © e 5 is a set of all possible sources of epidemic. 

By applying Bayes theorem, we get the following expression: 

. P(R ^r\@^ 0)P(@ = 6*) 
= arg max . 

If all (apriori) are equally likely, this is equivalent to: 

= arg max P(^ ^ r\& ^ 6). 

Thus, the core of source location estimation problem is the determination of the likelihood of 
the observed epidemic realization being initiated at the source location 0. We now proceed with 
description of the algorithms for determining the maximum likelihood for the observed epidemic 
realization. 

2.1. The Maximum Likelihood source estimator 

First, we give a pseudo-code (Algorithm 1) for the original problem of the maximum likeli- 
hood source estimation, where source can be any node from set S . In principle, this treatment 
can be extended to problem of multiple sources determination, but the necessary extensions are 
out of the scope of this work. Note that among algorithm parameters (G, p, q, r,, T,S,n) the 
parameter n represents number of random simulations from a single candidate for the epidemic 
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source node. In our framework, the number of random simulations n is very important from the 
perspective of the accuracy /stabiHty of resuhs and it is also a major determinant of the running 
time of the estimation procedure. 

Algorithm 1 The Maximum Likelihood source estimator algorithm: (G, p, q, r,, T,S,n) 

Input: Network structure G, SIR process parameters (p,q), S - {61,62, ■■-,6^} a set of pos- 
sible sources 6j, observed realization r, ending at some temporal threshold T , n a number of 
simulations 

for each 6j e S (apriori set of possible sources of epidemic) do 
Call likelihood estimation function (G, p, q, r,, T, n) 
Save P{R = r,|0 = 6j) 
end for 

Output 1: 6ii with maximum likelihood P{R - r,|0 - 6k) 
Output 2: Ranked sources in 5 = {61,62, ..., 6^] according to likelihoods P{R - r»|0 - dk) 



It is obvious that the Algorithm 12. II is just a wrapper code that calls likelihood estimation 
function for each potential source of epidemics. We now proceed with the description of differ- 
ent algorithms for calculating the likelihood P{R - r\@ - 6). 



2.2. Realization similarity matching 

Let us define the function ip{r\,f2), which measures the similarity between two epidemic 
realizations or subgraphs of the underlying network: fl and rj. 

We first define new random variable (fifl, fit), which measures the ip similarity between the 
fixed realization rt and random realization that comes from S IR process with the source 6. We 
can calculate the unbiased estimator of the following cumulative distribution function as the 
empirical distribution function: 



F(x)^P{^(rt,Rg)<x)^ 
where l[o,.t) is a characteristic function defined as: 



ZUho..r){v(r-t,Re.i)) 



Then, its probability density function is calculated like this: 

J 1 " 

PDF(x) = —Fix) = - X ^(-^ - <fi(rt,Re.i)) , 

where 6{x) is the Dirac delta function. 

Central limit theorem states that pointwise, F(x) has asymptotically normal distribution. The 
rate at which this convergence happens is bounded by Berry-Esseen theorem. This implies that 
the rate of convergence is bounded by 0(1 / -s/n), where n is the number of random simulations. 



Next, we define two measures (XNOR and Jaccard) that are used to determine the similarity 
(fi. The first one is a binary NOT XOR function or XNORifl, ri) counts the number of corre- 
sponding non-infected and infected nodes in realizations r\ and r^: 



XNOR{r-l,ri) = ^ (Ae(n(^), '"IW), 



keV 

,where il/^(mi,m2) function is defined as: 

1 : (oti = 1 and OT2 — 1) or {mi - and 1112 — 0) . 



In other words, i^(mi , 1112) is equal to one only if two nodes were infected or they did not get in- 
fected prior to temporal threshold T . We also define function: XNORifi , '"2), which is normalized 
XNOR function over total number of nodes: XNOR{fi, rj) * A^"' . 

The second similarity measure is a well known Jaccard measure, which in our case counts 
the number of corresponding infected nodes in r\ and in rj normalized by the number of corre- 
sponding infected nodes in rj or in rj. 

, j/^ ^x In Aril Zkev^A(ri{k),r2(k)) 
Jaccard{ri,r2) = ^ = :^; , r^n\ ^^f^^ ' 

where i/'a(»Ji, m2) and 4i\,{m\,m2) functions are defined as: 

1 : (mi = 1 andm2 = 1), 



and where t//\/(mi,m2) function is defined as: 

1 : (nil - 1 or m2 - 1) , 



^v(mi,m2)-, Q ^^j^^^ 



In the following text the tfx{ri , r2) will denote the similarity calculated with XNOR(ri , r2) 
function and (fj{f\,f2) will denote the similarity calculated with Jaccard{r\,f2) function. In 
order to speed the similarity matching between realizations, we use the bitwise operations (XOR, 
NOT, AND) and bit count with Biran-Kernignan method. 

2.3. Likelihood estimation functions 

In this section we define three variants of likelihood estimation functions: AUCDF, Avg- 
TopK, and Naive Bayes. First two functions, AUCDF and AvgTopK can use any of the similarity 
measures defined above, while Naive Bayes produces likelihood based on its own similarity 
measure. 

As a first likelihood estimation function we define AUCDF (Area Under Cumulative Distri- 
bution Function) (see Algorithm|2|i, which can use any of the similarity measures defined above. 



Algorithm 2 AUCDF estimation function (G, p, q, r,, T, 0, n) 



Input: G - network structure , {p, q) - SIR process parameters , r, - observed realization prior 

to some temporal threshold T, 9 - source for which likelihood is calculated, n a number of 

simulations 

for i —\ Xon (number of simulations) do 

- Run SIR simulation (p, q) with @ - 9 and obtain epidemic realization Rgi, ending at the 
temporal threshold T; 

- Calculate and save ip(r»,Rgj) ; 
end for 

- Calculate empirical distribution function: 



P(<p(n,Rg) < x) = 



Estimate likelihood using the area under the empirical cumulative distribution: 

AUCDFg^ I P((p(rt,Re)<x)dx 



Output: P(R = F,|0 = 6*) = 1 -AUCDFg likelihood for 0; 



Different sources 9 produce different empirical cumulative distributions of similarities to fl- 
If we compare two empirical distribution functions CDFi and CDF2 from two different sources 
9i and 02 and if the AUCDF i < AUCDF 2 then sample of realizations from 9\ source are more 
similar to fixed realization fl than the sample realizations from 02 source. This is the primary 
reason, why we use value 1 -AUCDF to estimate source likelihood P(R - r,|0 = 9). 

Algorithm AvgTopK represents a variant of the previous estimation function, which uses only 
k highest values from the tail of the probability density function of the random variable ipiK, rg)'- 



J 1 n 

PDFix) = —Fix) = - X ^(-^ - <fi(rt,Rej)) 

/=1 



Algorithm 3 AvgTopK likelihood estimation function (G, p, q, r*, T, 6, n) 

Input: G - network structure , {p, q) - SIR process parameters , r, - observed realization prior 

to some temporal threshold T, 9 - source for which likelihood is calculated, n a number of 

simulations 

for / = 1 to « (number of simulations) do 

- Run SIR simulation {p, q) with @ - 9 and obtain epidemic reaUzation Rgi, ending at the 
temporal threshold T; 

- Calculate and save if(r,,Rgi) ; 
end for 

- Sort the scores l'p(r*,Rgj)\ in descending order; 

- Average top k highest scores: 



1 
Output: P(R = r,|0 = 9) likelihood for 9; 



h / ^ \ ' J sorted 



In each simulation we calculate how similar Rgi realization to observed r» reaUzation is by 
using If function. Estimate P{R - r,|0 = 9) is the average score over top k highest similarities 
(p(rt,Re,i) in n simulations (tail of pdf). 

Finally, we propose the third likelihood estimation function which is based on node proba- 
bilities for being infected from a particular source node. Main assumption of this approach is 
independence between nodes with respect to epidemic spreading. 

The conditional probability that the node k in realization r, is infected from source 9 is: 

-~ ^ mi + e 

P{n{k) = 1|0 = 6») = — , Vfe e G, 

n + e 

where irik is the number of times that node k got infected from the total of n simulations S IR{p, q) 
from source node 9 and e is a smoothing factor. Smoothing factor e is necessary to mitigate 
the problem of zero values, stemming from the finite number of simulations used to calculate 
P(rjk)^ l\&^9). 

Then we define the estimator for the likelihood of observing reaUzation ft from source node 
6* as: 

P(R ^ rtl® ^ 9) ^ Y] P(Mk)^l\@^9) Y] (1 - M(;)l© = 0)). 

This equation uses the probability estimates that nodes {k : ft(k) - 1} from reaUzation ft got 
infected and the probability estimates that nodes {j : ftij) - 0) from realization ft did not get 
infected from source node 9. 

In mathematical sense, probability of finding an infected node k at time t is dependent on 
other infected nodes prior to time t. Nevertheless, we use the same assumption of independence 
to estimate the rank of potential sources. There is obvious resemblance between this approach 
and the well known studied probabilistic classifier - Naive Bayes. Although Naive Bayes uses a 
strong assumption of independence, it has been shown that in practice its performance is compa- 
rable to more complex probabilistic classifiers |J9|. 
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In order to have more stable numerical likelihood estimations, we used the log likelihood 
variant for estimating P{R = r,|© = 0)) (see Algorithm |3]l. 

Algorithm 4 Naive Bayes likelihood estimation function (G, p, q, r,, T, 6, n) 

Input: G - network structure , {p, q) - SIR process parameters , r, - observed realization prior 
to some temporal threshold T,9 - initial source for which likelihood is calculated 
-mu^Q-.'ikeV from G; 
for / = 1 to n (number of simulations) do 

- Run SIR simulation (/?, q) with @ - and obtain reaUzation «g_, prior to the temporal 
threshold T; 

- Update: m^ = m^ + 1 ; VA: which were infected in Ref, 
end for 

- Calculate: 

/^ ^ nil, + e 

P{rAk) = 1|0 = 0) = — , VA: e G 

n + e 

- Calculate log likehhood: log{P{R = r.l© = 0)) = 

- 2 logiPiMk) ^ i\& ^ 0)) + Yj iog{\ - P{Uj)\® ^ 0)); 

Output: log(P(R = r,\® = 0)) likelihood for 0; 



3. Epidemic source location experiments 

In this section, we describe the experiments along with the obtained results performed on dif- 
ferent networks and in different epidemic settings. The experiments were designed to illustrate 
the overall predictability properties of the source detection problem with the introduced infer- 
ence framework and compare the performances of individual algorithms. 

We test the performance of source likelihood estimation algorithms on single source epidemic 
detection problems. In our experiments we observe one spatio-temporal epidemic propagation 
realization rl and we want to infer the potential source of realization from the set S . In Figure [1] 
we illustrate one epidemic realization on a synthetic grid, where the color gradient from blue to 
red represents estimated source likelihood (blue - lower and red - higher ). We have used a Naive 
SIR algorithm implementation [1] as efficient SIR process simulation on network structures. 



Figure 1: One epidemic realization of SIR process {p = 0.3, q = 0.7) on a syntlietic grid, wliere tlie color gradient from 
blue to red represents estimated source likelihood (blue - lower and red - higher ). Node with the letter "A" represents 
true source of epidemic realization and the node with the letter "B" represents the Maximum Likelihood source estimate 
by the "Naive Bayes" likelihood estimation function 




m 







Due to the strong stochastic nature of epidemic process, frequency of correct estimations of 
the source location in the network is not the best measure to test the predictabiHty of algorithms. 
The topological distance of maximum likelihood node from true source can be a misleading low 
even for random estimations on networks with low average shortest path. Therefore, we measure 
the rank of true source in the output hst of potential source s from set S in experiments on different 
network structures. The overall testing procedure is given in the following pseudo-code. 

Let us assume that in some source location detection experiment we get realization r, that has 
k infected nodes. We rank the nodes 0, in a list of k potential nodes according to the likelihood 
P(R - r*|0 = dj). We express the rank of real source as a relative source rank, i.e. the rank 
of the true source node normalized to the list size (for example, if the rank of the true source 
node is at the position 10 in the list of 100 potential sources, then the relative source rank is 
0.1). For the performed batch of experiments, we calculate cumulative source rank probability 
distribution, which tells us the probability that the relative rank of the source node is lower or 
equal to some specified value. By its nature cumulative source rank is very similar to the well 
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Algorithm 5 Source location experiments 



for experiment = 1 to total number of experiments do 

- Sample random initial source 0» from network G 

- Obtain realization r* from SIR process SIR(p, q,® - 6,, T) that has at least 0.01 infected 
nodes in total network 

- Create set S as the set of all nodes which were infected in realization r, 

- Call the Maximum Likelihood source estimator algorithm (G, p, q, r,, T,S) 

- Measure the rank of true source on ranked likelihood list of S 
end for 



known receiver operating characteristic (ROC), a measure frequently used in signal detection 
and machine learning for measuring the performance of classifier systems. Ideal estimator or 
classifier would have area under the cumulative source rank equal to one, exactly as in the case 
of ROC measure (AUC measure represents the area under the ROC curve). One can argue that 
other measures might have been appropriate as well, for example the distance of the maximum 
likelihood node to the true source node in a network. We opted for cumulative source rank, 
because it is a more versatile measure, due to its invariance to network size size and structure 
(e.g. for networks with different average shortest paths one would get grossly different results). 

The influence of network structure on source localization performance has been tested on the 
following classes of networks: regular grid (figure |2) and lattice (figure [8]part A), Small- World 
networks (figure |8] part B), Erdos-Renyi networks (figure |8] part C), Albert-Barabasi network 
(figure|3]part A) and Western States Power Grid of the United States |24] (figure|3]part B). 

In order to measure the performance of source localization we have done the following ex- 
periments: 

• Comparison of different estimators: we compare performance of different algorithms for 
different epidemic conditions, 

• Network structure experiments: this set of experiments illustrates the effects of network 
structure on the prediction performance over diverse network topologies, 

• Process dynamics experiments: here we observe the effects of different process parameters 
like {p, q, T) on source locaUzation performance and 

• Uncertainty experiments: Performance degradation associated with uncertain epidemic 
parameters or incomplete knowledge about epidemic realization. 



Comparison of different estimators 

In Figures l4l5] we can see the results of the source location detection experiment for differ- 
ent likelihood estimation functions (AUCDF, AvgTopK and Naive Bayes) on different network 
structures. The cumulative probability function in these experiments measure the probability of 
ranking the true source at specific position. These results suggest that Naive Bayes and AvgTopK 
estimators have better performance than the AUCDF estimator For instance, we can see that in 
Figure|5]the Naive Bayes estimator ranks the true source in approximately 80 % of experiments 
in top 10 % of the source list. We have also made a baseline solution which uses random like- 
lihood estimation function to rank the potential sources (see Figures I4l5l l. Random likelihood 

11 



Figure 2: Visualization of regular grid of size N = 30.y30 
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Figure 3: Visualization of Albert-Barabasi network (part A) of size A' = 5000, with mg = 5 initial full connected core, 
and m = 1 added edges in preferential attachment. In part B: the visualization of power-giid network (Western States 
Power Grid of the United States 1 24] ) of size W = 494 1 . 
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estimation function returns random uniform probability value [0-1] for each node. Note, that 
the AvgTopK likelihood estimation function tends to give more accurate source localization per- 
formance than the Naive Bayes and AUCDF estimation functions. In our experiments we have 
used the top k - 5% of highest scores from pdf in AvgTopK likelihood estimation function. 



Figure 4: Cumulative probability distribution of source relative rank based on 500 experiments with random initial source 
on synthetic grid N = 30a:30 for p = 0.3, q = 0.7, T = 10 with different likelihood estimation functions. 
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Figure 5: Cumulative probability distribution of source relative rank for 500 experiments with random initial source on 
power grid network of size N = 4941 for p = 0.7, q = 0.6, T = 7 and different likelihood estimation functions. 
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Figure 6: Cumulative probability distribution of source relative rank for 100 experiments with random initial source on 
Albert-Barabasi network (TV = 5000, A/q = 5,m = 1) for p — 0.6, q — 0.2, 7 = 5 and different likelihood estimation 
functions. 
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Network structure experiments 

The effects of different network structures on source estimation performance is demonstrated 
with the following Small- World experiment. We are generating networks from regular lattice 
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(/8 = 0) to random networks (J3 - I) with Small-world networks in the middle and observing 
the performance of source estimators. We measure the area under the cumulative source rank 
function and observe that the performance of source estimator drops as the average shortest path 
of network decreases. 



Figure 7: Source location aggregate performance value: area under the cumulative probability of relative source rank 
(AvgTopk estimator with ifixQ similarity function) for 100 experiments on classes of networks (size: N = 5000) from 
regular lattice (y3 = 0) to random networks (/? = 1) with Small- world networks in the middle. SIR process has parameters 
p = 0.1, q = 0.8 and T = 1 . Average shortest path is normalized by average shortest path {x 120) in regular lattice. 
Average clustering coefficient is normalized by average clustering coefficient (ai 0.7) in regular lattice. 
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Figure 8: Classes of networks are generated according to the Watts -Strogatz small-world y3 model (size: N = 5000) 
from the regular lattice {j3 = and 10 local edges) to random networks (fi = {) with small- world networks in the 
middle. Visualization is done on smaller networks (size: 50) from regular lattice to random networks (part C: ^ = 1) 
with Small-world networks in the middle (e.g. part B: ft = 0.\). 
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Process dynamics experiments 

Finally, we perform a set of experiments to put our source location inference framework 
into a perspective with recent models for diffusion-like processes published in the literature 112 ill 
OlSll . We illustrate performance of our inference framework on diffusion like processes which 
can be understood as a limiting case of SIR process in which recovery parameter q is close to 
or equal zero. In Figure |9] and [TOl we can observe that the performance of source estimation 
algorithms is highest in these conditions. This is expected behaviour which can be interpreted as 
a consequence of that initial conditions are preserved more in diffusion-like processes. 



Figure 9: Cumulative probability of relative source rank for 100 experiments with random initial source on power-grid 
network {N = 4941) for different parameters q. Diffusion like processes are special case of SIR model where recovery 
parameter ^ = (red line). Experiments were performed with AUCDF likelihood estimation function with ipx similarity 
function 
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Figure 10: Cumulative probability of relative source rank for 100 experiments with random initial source on the Albert- 
Barabasi network (N = 5000, Mo = 5,m = 1) for diiferent parameters q. Diifusion like processes are special case of 
SIR model where recovery parameter q = (red hne). Experiments were performed with AUCDF likelihood estimation 
function with <fj similarity function 
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Uncertainty experiments 

Note that the previous experiments were performed on processes for which the parameters 
p, q and T were degenerative random variable i.e. constants. Now, we demonstrate the effects 
on performance when the exact values of p, q and T are sampled from probability distributions. 
We model the temporal threshold 7" as a random variable of the following form: T - T^ + e, 
where e represents the noise from some probability distribution. In Figure [TT] we have made a 
series of experiments where the e noise was modelled with the Geometric distribution with dif- 
ferent parameters. As the variance of noise is increased, the performance of source localization 
is decreased. We have also made a series of experiments in which the parameters {p, q) were also 
modelled with the noise: p = po + y, q - qo + y, where y noise was distributed as a Normal 
distribution with parameters (//, cr). In Figure [12] we observe that the performance of source 
location decreases as the noise of parameters p,q and T increases. This findings suggest that 
if the predictability is low for parameters p, q and T with no noise then predictability can only 
be lower when the noise is present. Furthermore, this implies certain limits of predictability for 
source localization on Small- World networks. 
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Figure 1 1 : Cumulative probability of relative source rank for 300 experiments with random initial source on the on 
power-grid network {N = 4941) for p = 0.7, q = 0.4, T = Tq + e, where e ~ Geometric distribution with different 
parameters and Tq = 15. 
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Figure 12: Cumulative probability of relative source rank for 100 experiments with random initial source on the on 
power-grid network (N = 4941) for p = po + y, q = qo + y, T = Tq + e, where Tq = 10, po = 0.7, qo = 0.4, e ~ 
Geometric distribution with parameter 0.5 and y ~ Normal distribution with parameters (p = 0,<t = 0.05). 
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In order to demonstrate the applicability of statistical inference framework for general type 
of compartment contagion processes, we have made a localization experiments with the infor- 
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mation/rumor spreading ISS (ignorant-spreading-stifler) model. The ISS model divide the indi- 
viduals to three groups: ignorants who have not heard the information/rumor, spreaders who are 
propagating the information/rumor to ignorants and stiflers who know the information/rumor and 
are no longer propagating it. The probability of spreading the information/rumor from spreaders 
to ignorants is a in one discrete time step. If the spreader interacts with other spreader or stifler it 
turns to stifler state with probability of y8. The infected nodes in the SIR model recovery accord- 
ing to its internal state contrary to the ISS model where spreaders becomes stiflers according to 
states of its neighbours. In figure [13] we can observe the localization performance of inference 
framework on ISS model on regular grid for diff'erent parameters {a,/3). Even in case when a 
fraction of random nodes in a network can be observed the statistical inference framework can 
localize the initial source (see figure [T7t. 



Figure 13: Cumulative probability of relative source rank for 100 experiments with random initial source on the on 
regular grid of size N = 30.y30 for the ISS spreading process for different parameters a,l3,T = 50 and different fraction 
of observed nodes (100 % of realization or 80 % or 60 % of random nodes in a realization) in a network. 
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4. Related work 



Although the research of epidemic processes on complex networks is very mature the prob- 
lem of epidemic source detection was formulated very recently. Various researchers have pro- 
posed different solutions to the problem of epidemic source detection which are based on number 
of assumptions on contact network structures and spreading models. 

Zaman et. al. formulated a problem, where the rumor spreads with the SI model over network 
structure for some unknown amount of time and observe information about which nodes got in- 
fected. They rise a question who is the most likely source of the rumor and when can they find 
him. As a solution to the problem of source detection, they developed a rumor centrality mea- 
sure, which is the maximum likelihood estimator for a regular trees under the SI model. They 
also obtained various theoretical results about the detection probability on different classes of 
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trees I'2lll. ll22ll . But, when the ramor spreading happens at the general graphs they use the simple 



heuristics that the rumour spreads along the breadth first search rooted at the source. Dong et. al. 
also studied the problem of rooting the rumor source with the SI model and demonstrate similar 
results of asymptotic source detection probability on regular tree-type networks [6]. Comin et. 
al. studied and compared different measures like degree, betweenness, closeness and eigenvector 
centrality as estimators for source detection |J5||. Pinto et. al. also formulated a similar problem 
of locating the source of diffusion in networks from sparsely places observers [18]. They also 
assume that the diffusion tree is a breadth first search, the model of spreading with no recovery 
and the exact direction and times of infection transfers. Spectral algorithms for detection of ini- 
tial seed of nodes that best explain given snapshot under the SI model has also been derived 11911 . 

Zhu et. al. adopted the SIR model and proposed a sample path counting approach for source 
detection [25=] . They prove that the source node on infinite trees minimizes the maximum distance 
to the infected nodes. They assume that the infected and susceptible nodes are indistinguishable. 
Lokhov et. al. use a dynamic message-passing algorithm to estimate the probability that a given 
node produces an observed snapshot. They use a mean-field-like approximation to compute the 
marginal probabilities and an assumption of sparse contact network [13.1 . 



Contrary to these approaches, our source estimation approach reduces the assumptions on 
network structures and spreading process properties. Our statistical inference framework can 
also work on arbitrary network structures and with arbitrary compartment spreading processes 
(SI, SIR, SEIR, ISS, etc.) 

5. Conclusion 

In this paper we have constructed a statistical framework for detecting the source location of 
an epidemic or rumour spread from a single realization of a stochastic contagion model on an 
arbitrary network. Detecting the source of an epidemic or rumour spreading under a stochastic 
SIR discrete model, represents an extension of existing research methodologies, mainly focussed 
on diffusion-like processes. Furthermore, this statistical framework can be deployed for different 
kinds of stochastic compartment processes (ISS, SI, SIR, SEIR) on networks whose dynamical 
patterns can be described by probability distributions over similarities among realization vectors. 
We have also demonstrated that we can relax even the assumptions on complete knowledge about 
epidemic realization, contagion process parameters and time with uncertainty. 
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